摘要 :
A problem on multicore systems is cache sharing, where the cache occupancy of a program depends on the cache usage of peer programs. Exclusive cache hierarchy as used on AMD processors is an effective solution to allow processor c...
展开
A problem on multicore systems is cache sharing, where the cache occupancy of a program depends on the cache usage of peer programs. Exclusive cache hierarchy as used on AMD processors is an effective solution to allow processor cores to have a large private cache while still benefitting from shared cache. The shared cache stores the “victims” (i.e., data evicted from private caches). The performance depends on how victims of co-run programs interact in shared cache.This article presents a new metric called the victim footprint (VFP). It is measured once per program in its solo execution and can then be combined to compute the performance of any exclusive cache hierarchy, replacing parallel testing with theoretical analysis. The work evaluates the VFP by using it to analyze cache sharing by parallel mixes of sequential programs, comparing the accuracy of the theory to hardware counter results, and measuring the benefit of exclusivity-aware analysis and optimization.
收起
摘要 :
Data race detection has become an important problem in GPU programming. Previous designs of CPU race-checking tools are mainly task parallel and incur high overhead on GPUs due to access instrumentation, especially when monitoring...
展开
Data race detection has become an important problem in GPU programming. Previous designs of CPU race-checking tools are mainly task parallel and incur high overhead on GPUs due to access instrumentation, especially when monitoring many thousands of threads routinely used by GPU programs.This article presents a novel data-parallel solution designed and optimized for the GPU architecture. It includes compiler support and a set of runtime techniques. It uses value-based checking, which detects the races reported in previous work, finds new races, and supports race-free deterministic GPU execution. More important, race checking is massively data parallel and does not introduce divergent branching or atomic synchronization. Its slowdown is less than 5 × for over half of the tests and 10 × on average, which is orders of magnitude more efficient than the cuda-memcheck tool by Nvidia and the methods that use fine-grained access instrumentation.
收起
摘要 :
Just-in-time (JIT) compilation coupled with code caching are widely used to improve performance in dynamic programming language implementations. These code caches, along with the associated profiling data for the hot code, however...
展开
Just-in-time (JIT) compilation coupled with code caching are widely used to improve performance in dynamic programming language implementations. These code caches, along with the associated profiling data for the hot code, however, consume significant amounts of memory. Furthermore, they incur extra JIT compilation time for their creation. On Android, the current standard JIT compiler and its code caches are not shared among processes—that is, the runtime system maintains a private code cache, and its associated data, for each runtime process. However, applications running on the same platform tend to share multiple libraries in common. Sharing cached code across multiple applications and multiple processes can lead to a reduction in memory use. It can directly reduce compile time. It can also reduce the cumulative amount of time spent interpreting code. All three of these effects can improve actual runtime performance.In this paper, we describe ShareJIT, a global code cache for JITs that can share code across multiple applications and multiple processes. We implemented ShareJIT in the context of the Android Runtime (ART), a widely used, state-of-the-art system. To increase sharing, our implementation constrains the amount of context that the JIT compiler can use to optimize the code. This exposes a fundamental tradeoff: increased specialization to a single process’ context decreases the extent to which the compiled code can be shared. In ShareJIT, we limit some optimization to increase shareability. To evaluate the ShareJIT, we tested 8 popular Android apps in a total of 30 experiments. ShareJIT improved overall performance by 9% on average, while decreasing memory consumption by 16% on average and JIT compilation time by 37% on average.
收起
摘要 :
When renting computing power, fairness and overall performance are important for customers and service providers. However, strict fairness usually results in poor performance. In this paper, we study this trade-off. In our experim...
展开
When renting computing power, fairness and overall performance are important for customers and service providers. However, strict fairness usually results in poor performance. In this paper, we study this trade-off. In our experiments, equal cache partitioning results in 131 % higher miss ratios than optimal partitioning. In order to balance fairness and performance, we propose two elastic, or movable, cache allocation baselines: elastic miss ratio baseline (EMB) and elastic cache space baseline (ECB). Furthermore, we study optimal partitions for each baseline with different levels of elasticity, and show that EMB is more effective than ECB. We also classify programs from the SPEC 2006 benchmark suite based on how they benefit or suffer from the elastic baselines, and suggest essential information for customers and service provider to choose a baseline.
收起
摘要 :
Cache in multicore machines is often shared, and the cache performance depends on how memory accesses belonging to different programs interleave with one another. The full range of performance possibilities includes all possible i...
展开
Cache in multicore machines is often shared, and the cache performance depends on how memory accesses belonging to different programs interleave with one another. The full range of performance possibilities includes all possible interleavings, which are too numerous to be studied by experiments for any mix of non-trivial programs.
收起
摘要 :
Shared caches are generally optimized to maximize the overall throughput, fairness, or both, among multiple competing programs. In shared environments and compute clouds, users are often unrelated to each other. In such circumstan...
展开
Shared caches are generally optimized to maximize the overall throughput, fairness, or both, among multiple competing programs. In shared environments and compute clouds, users are often unrelated to each other. In such circumstances, an overall gain in throughput does not justify an individual loss. This paper explores cache management policies that allow conservative sharing to protect the cache occupancy for individual programs, yet enable full cache utilization whenever there is an opportunity to do so. We propose a hardware-based mechanism called cache rationing. Each program is assigned a portion of the shared cache as its ration. The hardware support protects the ration so it cannot be taken away by peer programs while in use. However, a program can exceed its pre-allocated ration, but only if another program has unused space in its allocated portion of ration. We show that rationing provides good resource protection and full cache utilization of the shared cache for a variety of co-runs.
收起
摘要 :
As caches become larger and shared by an increasing number of cores, cache management is becoming more important. This paper explores collaborative caching, which uses software hints to influence hardware caching. Recent studies h...
展开
As caches become larger and shared by an increasing number of cores, cache management is becoming more important. This paper explores collaborative caching, which uses software hints to influence hardware caching. Recent studies have shown that such collaboration between software and hardware can theoretically achieve optimal cache replacement on LRU-like cache. This paper presents Pacman, a practical solution for collaborative caching in loop-based code. Pacman uses profiling to analyze patterns in an optimal caching policy in order to determine which data to cache and at what time. It then splits each loop into different parts at compile time. At run time, the loop boundary is adjusted to selectively store data that would be stored in an optimal policy. In this way, Pacman emulates the optimal policy wherever it can. Pacman requires a single bit at the load and store instructions. Some of the current hardware has partial support. This paper presents results using both simulated and real systems, and compares simulated results to related caching policies.
收起
摘要 :
The rise of social media and cloud computing, paired with ever-growing storage capacity are bringing big data into the limelight, and rightly so. Data, it seems, can be found everywhere; It is harvested from our cars, our pockets,...
展开
The rise of social media and cloud computing, paired with ever-growing storage capacity are bringing big data into the limelight, and rightly so. Data, it seems, can be found everywhere; It is harvested from our cars, our pockets, and soon even from our eyeglasses. While researchers in machine learning are developing new techniques to analyze vast quantities of sometimes unstructured data, there is another, not-so-new, form of big data analysis that has been quietly laying the architectural foundations of efficient data usage for decades. Every time a piece of data goes through a processor, it must get there through the memory hierarchy. Since retrieving the data from the main memory takes hundreds of times longer than accessing it from the cache, a robust theory of data usage can lay the groundwork for all efficient caching. Since everything touched by the CPU is first touched by the cache, the cache traces produced by the analysis of big data will invariably be bigger than big. In this paper we first summarize the locality problem and its history, and then we give a view of the present state of the field as it adapts to the industry standards of multicore CPUs and multithreaded programs before exploring ideas for expanding the theory to other big data domains.
收起